[Model] add colqwen2_vl code & inference#14291
[Model] add colqwen2_vl code & inference#14291BloomBerry wants to merge 7 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
|
Thanks for implementing this! Can you update the following files as well?
|
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
|
Hey @BloomBerry I'm working on reviving this PR since it has drifted away from the refactors on main and needs some more testing. Would you want me to push to this PR myself or I can start a new one. It seems to require this Transformers PR huggingface/transformers#35778 |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi, is there an estimation when the PR will be merged? |
|
Is anyone working on this? |
|
I was able to serve ColQwen 2.5 vl 3B (https://huggingface.co/Metric-AI/ColQwen2.5-3b-multilingual-v1.0) with vllm by doing some modifications to the source code. The idea is to use the Qwen 2.5 VL with ALL pooling type so it outputs all embedding vectors for late interaction. Here is a git patch you can apply on vllm source code (tested with v0.11.0). I am using it with the local weights of You just need to change the architecture name in the config.json from [
{
"idx": 0,
"name": "0",
"path": "",
"type": "sentence_transformers.models.Transformer"
}
]I am running the openai compatible server in docker compose as follows: entrypoint: ["vllm", "serve"]
command:
- "/root/.cache/huggingface/hub/models--Metric-AI--ColQwen2.5-3b-multilingual-v1.0/snapshots/e2a1c05d053dcf4ad6e39b6c48ced9d6a81071f0"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
- "--runner"
- "pooling"
- "--convert"
- "embed"
- "--dtype"
- "bfloat16"
- "--max-model-len"
- "1024"
- "--gpu-memory-utilization"
- "0.8"
- "--trust-remote-code"
- "--quantization"
- "bitsandbytes"
- "--override-pooler-config"
- '{"pooling_type":"ALL","normalize":true}'
- "--served-model-name"
- "anyname"It is working well with high throughput on a 8GB GPU. Hope it helps. |
|
Does your patch support multimodal (image) embedding ? |
@HoangTung-Vu Yes indeed. You should follow the same query strucuture as copali-engine: However, you cannot use openai client code because it does not support multimodal embedding. |
|
I already used requests directly instead of OpenAI client code but i encountered 400 Bad Request Error. If i comment out the image part, it works |
|
@HoangTung-Vu I need more context to understand why it happened to you. Could you tell me exactly the steps you did, and the whole message error? |
|
I applied your patch using Git commands, but it raised some errors, so I manually integrated the changes instead. For the model, I cloned OpenGVLab/colqwen2_5-3b-base, added the modules.json file as in your implementation, and updated the model class in config.json. However, when sending a request to the model, I still receive a 400 Bad Request response. |
|
@HoangTung-Vu Make sure that vllm is loading the correct model. It happened to me that it loaded a default model because it could not load the local one. I did install the docker version for my specific hardware so it was faster. Here is my docker compose for an RTX 3070: |
|
I ran my tests on a cloud instance from Vast.ai. Since it is a virtual container environment, I was not able to use Docker Compose as in your setup. For the model (ColQwen), I cloned it directly from Hugging Face. I chose the base model so that I could edit the model_class field in config.json. The fine-tuned variants only include adapter configurations, so they were not suitable for this purpose. When running vLLM, I pointed directly to the local model directory, so I assume it correctly loaded the intended model. Regarding vLLM itself, I installed it from source using: I suspect that the 400 Bad Request error might be caused by an incorrect configuration of the ColQwen model on my side. I’ll review the model setup again to ensure it matches your patch specifications. |
|
@HoangTung-Vu |
|
I have rechecked the configuration and reinstalled everything. |
|
What does your base64 URL look like? Make sure it is in the correct format |
|
@HoangTung-Vu |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
Hi @BloomBerry, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Documentation preview: https://vllm--14291.org.readthedocs.build/en/14291/ |
|
This pull request has merge conflicts that must be resolved before it can be |
@issahammoud I successfully ran your patch on vllm. However, when I call v1/embeddings, whether for text or images, the returned result is always a single 128-dimensional vector. why is this? thank you |
|
@TaoZC1996 |
|
Closing as stale — this has had unresolved merge conflicts and failing checks for a long time. Feel free to open a fresh PR if you'd like to revisit. Thanks for the contribution! |
Add support for ColQwen2VL model
Description
This PR adds support for the ColQwen2VL model to vLLM. ColQwen2VL is an efficient document retrieval vision language model based on Qwen2VL, as described in the paper "ColPali: Efficient Document Retrieval with Vision Language Models". The model is designed to generate embeddings rather than text outputs, making it suitable for document retrieval applications.
Key implementation details:
Extended the existing Qwen2VL implementation for ColQwen2VL compatibility
Implemented custom text projection layer and L2 normalization for embedding generation
Added appropriate processing utilities for image and video inputs
Overrode forward, compute_logits and sample methods to optimize for embedding output
This implementation enables users to leverage ColQwen2VL's multimodal document retrieval capabilities through vLLM's efficient serving infrastructure.
Testing
Tested with sample image inputs
Verified embedding output format and dimensions
Confirmed compatibility with HuggingFace ColQwen2VL models
FIX #19381